-
Notifications
You must be signed in to change notification settings - Fork 57
Add back handling for use_delimiter, making globs with multiple * faster #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6906e6b to
269606a
Compare
|
Thanks - bringing back the delimiter is fine by me (although I don't fully understand what this is doing that is causing this speed-up). One thing - when I run this locally I get an error: set s3_region='us-east-2';
FROM glob('s3://coiled-datasets/timeseries/*-years/parquet/*.parquet');There seems to be something going wrong with concatenation of URLs ( |
|
I don't reproduce, this smells like you might have credentials or region enabled? That it's a longstanding (and weird) issue. Can you check |
|
As to why this is faster, basically in the case of multiple folders, like: Say that you glob on
Now that I write it down, it can be more expensive since every level gets an extra listing, advantage it should be that those are parallelizable, while the single full folder scan it's not parallelizable. |
1651893 to
336ae6f
Compare
ee13c6b to
cd8c3ab
Compare
Test is:
that takes now about 2s and 4 requests (vs 42s and 46 requests, both main and v1.4-andium behaviour).
Builds on #216, attempting to solve the comment at #216 (comment).
I think this might be slightly worse IF a single glob is present in the middle, but not by much, and the gains are not bounded in case of multiple globs.